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ABSTRACT 


A  new  algorithm  for  solving  FEM  problems  is  presented.  It  blends  a 
preconditioned  conjugate  gradient  iteration  into  a  direct  factorization 
method.  The  goal  was  to  reduce  fill  to  a  negligible  level  and  thus  reduce 
storage  requirements  but  it  turned  out  to  be  faster  than  its  rivals  for  an 
importent  class  of  problems. 
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1 .  INTRODUCTION 


The  numerical  method  proposed  here  differs  in  a  single  but  critical 
aspect  from  the  one-way  dissection  algorithm  described  by  Alan  George 
in  [George,  1900].  We  have  been  using  a  dissection  procedure  for  irregular 
domains  which  is  close  to  his  and  so  we  can  simplify  our  report  by  focussing 
on  the  point  of  difference  and  its  consequences. 

The  problem  we  both  address  is  the  solution  of  an  N  by  N  linear  system 
Ax  =  b  arising  in  the  application  of  finite  element  methods.  In  particular 
we  assume  that  A  is  symmetric,  positive  definite,  and  sparse. 

Although  nested  dissection  schemes  are,  in  various  ways,  asymptotically 
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optimal  their  preeminence  is  not  apparent  for  values  of  N  as  small  as  10  . 
Consequently  one-way  dissection  ordering  schemes  are  also  worthy  of  study. 

The  use  of  one-way  schemes  is  illustrated  by  considering  block  2  by  2 
matrices 


A  = 


with  A^^ 


1,2. 


In  the  course  of  solving  Ax  =  b  by  a  direct  method  we  must  solve,  explicitly 
or  implicitly,  a  reduced  system  involving  the  Gauss  transform,  namely 


(1) 


A  -i  -r 

^22  ”  ^22  *  ^  ^1 1  ^ 


i 


Recently  George  and  Liu  [Computer  Solution  of  Large  Sparse  Positive  Definite 
Systems,  1981]  have  shown  a  clever  way  of  computing  the  Cholesky  factor  1.2 
of  A22  without  utilizing  any  storage  beyond  E  and  L-j  ,  the  Cholesky  factor 
of  ^1^  .  Of  course  L2  will  be  denser  than  A22- 

Our  preference  however,  is  to  solve  the  subsystem  (1)  iteratively,  using 
preconditioned  conjugate  gradients  (CG),  without  ever  forming  A22  explicitly. 
Our  method  requires  storage  for  L.|,  L2  and  E  .  Convergence  of  the  iteration 
is  governed  by  a  matrix 

T  -1  -T 

W  =  I  -  C  C  ,  where  C  =  L2  E  L.j  , 

as  described  in  the  next  section. 

There  are  node  orderings  in  which  A.j.|  and  A22  are  dense  banded 
matrices  and  in  these  cases  there  is  no  fill  at  all.  On  other  occasions  it 
pays  to  allow  a  little  fill  in  and  L2  ,  within  the  band.  We  follow 
George  in  using  the  word  fill  rather  than  fill-in. 

Although  in  general  we  are  prepared  to  sacrifice  some  execution  time  in 
order  to  economize  on  storage  we  have  not  had  to  give  up  any  speed  on  any  of 
the  problems  we  have  tackled  so  far.  On  the  contrary,  our  method  is  signifi¬ 
cantly  quicker  than  its  obvious  rivals.  Details  are  in  Section  3. 

At  this  point  we  must  say  a  word  about  the  "sin"  of  invoking  an  iteration 


in  the  middle  of  a  respectable  direct  method.  We  have  been  so  impressed  by 
all  the  effort  and  ingenuity  which  has  been  expended  on  reducing  fill  during 


Gaussian  elimination  that  we  succumbed  to  the  temptation  to  get  rid  of 
fill  either  completely,  or  almost  completely. 

It  will  occur  to  the  reader  that  a  more  natural  way  to  avoid  fill  is  to 
solve  the  original  system  Ax  =  b  by  an. iterative  technique  such  as  CG. 

Yet  even  when  the  system  is  preconditioned  by  the  matrix  diag  (A.|^  ,  A22) 

CG  is  at  least  twice  as  slow  as  our  hybrid  method. 

Of  course,  a  direct  method  will  be  quicker  when  there  are  many  right 
hand  sides.  When  there  are  few  then  the  hybrid  method  is  comparable  in 
arithmetic  work  besides  demanding  less  storage. 

The  success  of  our  approach  depends  on  the  extent  to  which  A22  'is  a 

good  preconditioner  for  A22  •  In  Section  2  we  analyze  the  model  problem, 

in  part  because  the  analysis  can  be  done  quite  swiftly  with  pencil  and  paper. 

We  show  that  even  simple-minded  orderings  yield  lower  arithmetic  costs, 

0  (N^‘^)  ,  than  comparable  direct  methods ,  namely  for  one-way 
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dissection  and  0(N  )  for  profile  solvers.  In  Section  3  we  show  the  results 
of  several  methods  on  a  few  realistic  problems,  including  the  deflection 
of  a  folded  plate  structure  which  poses  a  severe  test  for  our  method  because 
of  its  large  condition  number.  In  Section  4  we  discuss  the  efficiency  of 
various  methods. 

A  novel  feature  of  the  hybrid  approach  is  that  it  leads  to  the  search 
for  orderings  which  keeps  A  dense  around  the  main  diagonal,  neither  small 
envelope  nor  narrow  bandwidth  are  important  attributes  in  this  approach. 


It  has  not  escpaed  our  attention  that  it  is  possible  to  use  a  somewhat 


Kk 


more  powerful  but  also  more  complicated  preconditioner  than  ^22  for  solving 
A22  X2  =  C2  .  Both  this  possibility  and  the  application  of  these  ideas  to 
3-D  problems  are  under  investigation. 

2.  APPLICATION  TO  THE  MODEL  PROBLEM 

Consider  the  5  point  Laplacian  operator  acting  on  a  regular  square 
2 

grid  with  m  unknowns  and  h  =  l/(m+l)  .  There  exist  very  efficient  special 
methods  for  solving  this  special  problem  -a^  u  =  f  and  the  only  reasons  for 
considering  it  here  are  (1)  we  can  obtain  the  convergence  rate  analytically 
and  (2)  we  get  an  understanding  of  how  our  algorithm  behaves. 

In  order  to  keep  the  analysis  simple  we  use  a  less  than  optimal  ordering 
in  which  all  the  nodes  in  the  odd  columns  are  numbered  before  the  nodes  in 
the  even  columns,  as  shown  in  Fig.  1. 

The  matrix  representation  of  -A^  which  arises  from  this  ordering  is 
shown  below. 


Here  D  is  a  m  x  m  tridiagonal  Toeplitz  matrix  with  nonzero  elements 
(-1  ,  4,  -1).  The  condition  number  of  A  is  known. 


Cond  A  = 


1  +  cos  Trh 
1  -  cos  Trh 


h  =  l/(m  +  1). 


2 

■  cot 


(’fh/2) 


for  small  h  . 


Let  the  Cholesky  factor  of  be  Lj  ;Ajj  =  L^lT  ,  j  =  1,2.  Note  that 
there  is  no  fill  when  forming  L.  .  Our  algorithm  uses  the  preconditioned 
conjugate  gradient  iteration  to  solve  equations  with  coefficient  matrix 


A22  -  E  A 


11  ^ 


The  matrix  which  governs  the  iteration  is 

W  =  I  -  (L'^  E  L'^  )  (L‘^  e"^  ) 

and  the  convergence  rate  may  be  bounded  in  terms  of  cond  (W).  Next  we 
determine  the  largest  and  smallest  eigenvalues  of  W  in  the  easy  case  when 
All  "  ^22  "  ^  even,  and  L.|  =  L2  =  L  . 

Note  that  W  is  similar  to  another  symmetric  matrix  W  =  L  W  L’^  = 

I  -  E  H  ^  E^  .  On  writing  out  E  E^  it  is  seen  to  be  a  block 
tri diagonal  Toeplitz  matrix  with  nonzero  blocks 

(  D*^  ,  20'^  ,  D*^  ) 


except  for  the  last  block  row  which  is 


(0,  ...  ,  0,  D‘^  ,  D'^). 


_2 

This  .Tiatrix  is  a  Kronecker  (or  tensor  )  product  T  ®  D  where  T  is  the 
iti/2  by  m/2  tri diagonal  Toeplitz  matrix  with  nonzero  elements  (1,  2,  1), 
except  for  its  last  row  which  is  (0,  ...  ,  0,  1,  1). 

The  eigenvalues  of  T  and  D  are  well  known.  See  [Gregory  and  Karney, 
p.  137]  to  verify  the  following,  using  h  =  l/(m  +  1)  , 

(T)  =  2  (1  +  cos  2k-Thl  ,  k  =  1,  ...  ,  m/2  , 

\.  (D"^)  =  (4  -  2  cos  jirh)"^  .  j  =  1,  •••  ,  rn  . 

0 

-2  2 

The  eigenvalues  of  T  ®D  are  the  m  /2  products  of  eigenvalues  of  T 

-2  2 

and  eigenvalues  of  D  .  Using  cos  mh  =  -cos  rrh  =  2  sin  ih/2  -  1  we  find 

{T®D  ^)  =  4  sin^  (  ^  )  ’  [6  (1  -  -j  sin^  — ^  )  ]  ^ 

1  2 

t  (irh)  ,  for  small  h  . 

W  4  cos^  (^h)  •  [2  (1  +  2  sin^  ^  )  ]'^ 

#  (1  -  TT^h^)  (1  -  j^h^)  ,  for  small  h  . 


In  conclusion 

cond  (W)  =  ^  -  ^.min  (T®D'^i 

1  '  ^max  (T®D-2) 

i  ^  ~  36 

2(nh)^ 


In  other  words 


%  1/2  (ith)^  ,  for  small  h 


cond  (W)  =  cond  (A)/8  . 

Perhaps  this  last  comparison  is  unfair.  If  preconditioning  is  applied 
to  the  Gauss  Transform  matrix  ^22  ~  ^  should  also  be  applied  to  A 

It  gives  the  same  convergence  rate  as  ordinary  CG  applied  to  the  matrix 


I  CT 


where  C  =  L*^  E  L'^  and  LL^  =  A^.|  =  A22 


To  each  eigenvalue  A  of  CC  ,  determined  above,  correspond  eigenvalues 
±  /a  of  A  -  I  .  It  follows  that 


1  +  /  A„^  (T®  D--) 

cond  (A)  =  - ^ - 

1 


^  2  -  i  2 


,  when  h  is  small 


Thus,  for  small  h. 


cond  (W)  f  cond  (A)  /  4  =  cond  (A)  /  8  . 


The  condition  number  yields  a  crude  upper  bound  on  the  convergence  rate 
of  CG.  The  standard  theory,  in  [Hestenes  and  Stiefel ,  1952],  asserts  that 


when  CG  is  applied  in  exact  arithmetic  using  a  positive  definite  matrix  M 


then  the  energy  norm  of  the  residual  r  ,  i.e.,  >■  r^  r  ,  is  reduced 
after  k  steps  by  at  least  a  factor  of 


2  (  f  cond  (mF  -  1 
/  cond  (M)  +  1 

However  for  k  >  order  (M)/2  this  bound  is  much  too  big.  In  our  case  for 
small  h  and  modest  k  the  reduction  factor  per  step  is  at  least 

1  -  rrh  ,  for  A  , 

1  -  '^hvT  ,  for  A  , 

1  -  -h2vr  ,  for  W  . 


In  other  words  CG  with  W  requires  half  the  number  of  steps  used  by 
CG  with  A  .  One  step  with  W  requires  less  work  than  one  step  with  A 
and  so  our  hybrid  method  is  more  efficient  than  "pure"  preconditioned  CG 
on  A  for  the  model  problem.  Our  tests  bear  this  out. 


The  standard  theory  given  above  for  estimating  convergence  is  exact 
only  for  the  Chebyshev  -  like  distribution  of  eigenvalues  of  A  .  In  the 
model  problem  the  eigenvalues  are  known  and  more  refined  bounds  can  be 
obtained.  For  example  after  k  steps  the  residual  norm  will  have  been 
reduced  by  at  least 


(>„-  'll  >2)  *3> 

(a^  -  0  )  (X^  -  0  )  (\3  -  0  ) 


< 


1 


-  M’ 


<  X. 


The  eigenvalues  A^.  satisfy 


0  <  Aj  <  X^ 


:  1—.  (1  .  4,'2  . 

50  (’Th)^ 

Here  is  the  Chebyshev  polynomial  of  degree  k  . 

Table  1  gives  the  predicted  number  of  steps  Sp  and  the  actual  number 
of  steps  Sa  needed  to  reduce  the  residual  norm  to  lO”^  of  its  original 
value.  The  refined  bounds  (using  42.  ''2*  '4)  given  in  parenthesis. 
Recall  that  m  +  1  =  1/h  .  An  unsymmetric  load  was  used  for  the  right  hand 
side. 


TABLE  1 


m  +  1 

m  =  m^ 

Cond  (W) 

Sa 

20 

361 

20.79 

31  (19) 

19 

40 

1521 

81.58 

63  (37) 

36 

80 

6241 

324.8 

124  (77) 

66 

We  turn  next  to  direct  methods.  These  require  more  storage  and  have 
significant  advantages  when  many  right  hand  sides  are  given.  Here  we  con¬ 
sider  only  the  execution  costs  for  one  right  hand  side. 

Our  numerical  tests  showed  that  the  standard  profile  or  envelope  schemes, 
found  so  often  in  finite  element  packages,  were  significantly  slower.  The 
greater  the  value  of  N  the  slower  they  were  relative  to  N  .  The  reason 
is  simple.  Envelope  methods  require  0(N  )  operations  On  the  other  hand 

the  hybrid  scheme  needs  0(iN^"^)  ;  each  step  requires  0(N)  operations 

. . .  ..  . - -  . 


and  the  number  of  iterations  required  to  converge  to  working  precision  is 
0(h'^)  =  0(m)  =  O(N^)  for  the  model  problem. 

Of  course,  the  coefficients  of  these  leading  terms  are  needed  to  complete 
the  picture.  Our  tests  suggest  that  for  direct  and  hybrid  methods  the 
coefficients  are  the  same  order  of  magnitude.  Even  for  m  =  20  the  hybrid 
method  was  twice  as  fast  as  the  envelope  (or  profile)  Cholesky  factorization 
algorithm. 

We  have  not  compared  our  method  with  George's  One  Way  Dissection  algorithm 

on  the  model  problem  but  we  expect  his  to  be  halfway  between  ours  and  the 

envelope  algorithm  because  George  has  shown  that  the  number  of  operations 
1  75 

is  0(N  ‘  )  for  the  model  problem. 

We  have  not  given  precise  comparisons  because  we  do  not  think  that  too 
much  weight  should  be  given  to  the  model  problem.  It  is  useful  in  giving 
insight  into  the  algorithms  but  it  is  no  benchmark. 

In  the  next  section  we  take  up  more  realistic  applications. 


3.  NUMERICAL  EXAMPLES 

In  this  section  we  compare  the  hybrid  method  (called  H)  described  earlier, 
with  the  one-way  dissection  method  (called  IWD).  The  test  problems  used  for 
this  comparison  are  typical  of  those  found  in  finite  element  structural 
analysis.  More  detailed  illustrations  of  these  problems  are  provided  in  Figs. 
2  and  3  .  We  are  aware  that  a  certain  method  may  perform  better  than  all  others 


when  applied  to  a  narrow  class  of  problems,  and  a  truer  evaluation  of  the 
method  can  only  come  from  a  much  wider  bed  of  test  problems.  We  will  try 
to  point  out  any  bias  in  our  test  problems.  In  any  case,  they  do  represent 
an  area  where  sparse  techniques  are  applied  extensively. 

In  the  following  tables  an  'operation'  means  a  multiplication  followed 
by  an  addition.  Comparisons  between  execution  times  were  avoided  because 
these  times  depend  very  much  on  the  implementations  of  the  algorithms  and  some¬ 
what  on  the  computer  system  that  is  used. 

The  first  set  of  results  was  obtained  by  applying  the  two  methods  (IWD  &  H) 
to  the  following: 

(i)  A  plane  stress  problem  producing  a  matrix  of  the  form 
shown  in  Fig.  2.  This  involves  Poisson's  equation 

-  Au  =  f 

with  both  Dirichlet  and  Neumann  bounary  conditions. 

(ii)  The  fourth  order  bi harmonic  equation 

A^u  =  f 

inside  a  rectangular  domain  with  a  similar  mixed 
boundary  condition.  This  corresponds  to  a  partially 
clamped  plate  problem. 

(iii)  A  folded  plate  structure  is  a  more  practical  one  and 
we  feel  this  problem  is  a  crucial  test  of  our  method. 


These  problems  have  a  very  large  condition  number  and 
traditionally  iterative  methods  have  not  been  used 
because  of  their  depressing  rates  of  convergence. 

The  domain  was  discretized  using  a  finite  element  mesh.  Three  different 
node  orderings  were  applied  to  the  mesh  and  the  resulting  system  of  equations 
was  solved,  for  each  ordering,  using  the  above  methods.  In  [2]  Alan  George 
derived  an  expression  for  the  ratio  order  :  order  (A22)  •  Based  on 

this,  we  found  that  the  optimum  ratio  in  (i)  and  (ii)  for  IWD  and  H  is  about 
3:1.  In  (i)  and  (ii)  the  right  hand  side  of  the  equation  (load  vector)  was 
physically  symmetric.  This  results  in  a  somewhat  faster  convergence  rate 
than  for  a  general  right  handside.  An  unsymmetric  load  vector  increases  the 
number  of  iterations  by  about  30%. 

In  all  the  test  problems  the  tolerance  on  the  residual  norm  was  set  at 
computer  precision  and  the  initial  vector  was  chosen  as  a  zero  vector.  Of 
course  for  a  lower  tolerance  and  a  better  initial  vector  the  number  of  itera¬ 
tions  will  be  smaller. 


Storage 

1  5 

The  hybrid  method  H  requires  a  total  of  -^-NZA  +  (^  +  4n)N  storage 
cells  when  conjugate  gradient  is  used  for  the  iterative  part.  NZA  is  the 
total  number  of  non-zero  terms  in  A  and  n  =  order  (A22)/order  (A)  and  for 
the  ordering  described  in  Ref.  (1),  0  _<  n  <_  ^  .  Therefore  the  hybrid 
method  will  require  fewer  than  j  (NZA  +  7N)  storage  cells.  This  compares 
with  ^  (NZA  +  11  N)  for  the  conjugate  gradient  method  with  no  preconditioning. 
Thus  there  is  some  justice  in  saying  that  H  gets  by  with  minimal  storage. 


Arithmetic  Work 


Although  the  number  of  iterations  required  to  converge  is  problem 
dependent  (it  depends  on  the  eigenvalue  distribution  of  the  matrix),  the 
examples  presented  here  demonstrate  a  substantial  saving  in  operation  counts 
and  we  feel  this  will  be  true  for  a  very  large  class  of  finite  element  problems 


Table  (2)  -  Comparison  of  the  Direct  and  Hybrid  Method  for  the  Plane 
Stress  Problem  (N  =  220) 


1 

Direct  Method 

IWD 

Hybrid  Method 

H 

no.  of  oper. 

X  103 

storage 

no.  of  its. 

no.  of  oper. 

X  103 

storage 

702 

4982 

30 

136 

2396 

351 

3432 

23 

100 

2264 

1/4 

216 

2810 

19 

84.4 

2278 

profile 

85.8 

(50.8) 

5382 

(4726) 

-- 

1 

1 

1 

1 

Table  (3)  -  Comparison  of  the  Direct  and  Hybrid  Method  for  the 
Plate  Problem  (N  =  330) 


Direct  1 
IWD 

Method 

Hybrid  Method 

H 

n 

no.  of  Oper.  j 

X  103 

storage 

no.  of  its. 

no.  of  oper. 

X  103 

storage 

— 

2375 

11027 

45 

■■1 

4890 

1185 

7585 

35 

4698 

■B 

730 

6240 

26 

4779 

profile 

■m 

1 

1 

1 

1 

-- 

For  the  profile  method  the  numbers  within  parenthesis  are  exact  numbers 
obtained  through  computations. 


1 


I 


Table  (4)  -  Comparison  of  the  Direct  and  Hybrid  Method  for  the 
Folded  Plate  Structure  (N  =  1109) 


Direct  Method 

IWO 

Hybrid  Method 

H 

n 

no.  of  oper. 

X  106 

storage 

no.  of 

its. 

no.  of  oper. 

X  lOS 

storage 

1/2 

90.0 

90440 

209 

11.2 

29600 

1/5 

27.3 

44300 

130 

7.0 

29600 

profile 

14.0 

110900 

1 

- 

— 

4.  EFFICIENCY 

The  step  counts  for  the  model  problem,  given  in  Table  1,  are  greater 

than  they  need  be.  For  the  sake  of  an  elegant  analysis  we  took  the  ratio 
p  =  order  (A^^)  ;  order  {A^2^  ratio  depends  on  the  number 

of  separators  used  in  the  node  ordering  phase  which  is  described  in  [2]. 

For  a  reactangular  domain  a  typical  choice  might  designate  columns  4,  8,  12,  .. 
as  separators  with  the  result  that 

_  order  (Ap?)  _  1 

N  ^ 

There  would  be  a  little  fill  in  the  Cholesky  factors  L-j  and  L2  .  whose 
half-band  widths  enlarge  to  4,  but  the  cost  of  each  step  in  the  iteration  is 
essentially  independent  of  n  .  Thus  the  operation  count  depends  on  S  , 
the  number  of  steps,  which  in  turn  depends  on  cond  (W)  .  We  do  not  have 
a  formula  for  cond  (W)  when  ^  but  empirically  we  find  that 

cond  (W)  n  .  Thus  to  compare  two  schemes,  corresponding  to  n  and  n'  , 


On  the  model  problem  the  step  counts  in  Table  1  were  indeed  reduced  by 
factors  of  /2/3  and  /2/4  when  we  tried  n  =  1/3  and  n  =  1/4  . 

What  is  interesting  is  that  we  found  the  same  reduction  for  the  same  n 
values  on  the  4th  order  problem  of  Section  3.  We  have  no  explanation  for 
this  yet. 

Our  experience  with  our  hybrid  method  is  limited  but  in  order  to 
sharpen  discussion  we  will  make  some  comparisons,  at  least  for  2-D  problems. 
Profile  (or  envelope)  solvers  are  very  attractive  for  small  problems 
(say  200  equations)  and  they  can  be  included  in  our  hybrid  program  by 
choosing  n  =  0  (no  separators).  Of  course  the  small  operation  count  is 

offset  by  considerable  storage  demands.  For  the  model  problem  the  hybrid 
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scheme  needs  0(N  ’  )  operations  while  profile  solvers  need  0(N  )  operations. 

The  break  even  point  appears  to  be  close  to  M  =  600.  Consequently  we  choose 

this  value  as  the  upper  limit  for  small  problems. 

Our  ratings  are  given  in  Table  5  which  is  based  on  the  results,  in 
Table  3,  for  the  partially  clamped  plate  problem. 


Table  5.  -  Rating  of  Methods  for  2-D  Problems  on  Souare  Domains 
with  Regular  Meshes.  1  =  Best.  NE  =  number  of  equations. 


Comments  on  Table  5 

1.  For  a  tree-like  domain  the  hybrid  and  one  way  dissection  methods  win 
over  the  profile  method  much  sooner. 

2.  Profile  solvers  are  good  for  long  narrow  domains, 

3.  Hybrid  always  dominates  one-way  dissection. 
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(c)  3:1  node  ordering 


Fig.  2  The  Regular  Rectangular  Mesh  Used  in  the  Plane  Stress  and  Plate  Problems 
with  Different  Node  Ordering  and  the  Resulting  Matrix  Structures. 


Fig.  3(a)  Different  Matrix  Structures  Induced  by  Two  Different  Node  Orderings 
of  the  Folded  Plate  Structure  Below. 


A,B  encastered  end 


Fig.  3(b)  Finite  Element  Discretization  of  the  Folded  Plate  Structure. 
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